{
 "metadata": {
  "name": "",
  "signature": "sha256:a4569d46c1bac47e82865be891eca8dde7b63770994ed02428caba9b91622db2"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#Outline\n",
      "* Motivate the use of the Binomial and Poisson distributions in social media analysis\n",
      "* Explore an important hypothesis test: the p-value\n",
      "* Address shortcomings"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# The Binomial and Poisson distributions"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Goal:** demonstrate why many events in social media should be Poisson distributed.\n",
      "\n",
      "Consider a set of trials where, for each trial, something happens (a \"success\") or does not happen (a \"failure\").\n",
      "\n",
      "$p = $ probability of a success  \n",
      "$1-p = $ the probability of a failure  \n",
      "$N = $ number of trials\n",
      "\n",
      "The Binomial distribution is a function of the discrete variable $k$. Its value $B(k;p,N)$ give the probabilty of observing $k$ successes in $N$ trials, given success probability $p$. \n",
      "\n",
      "$B(k;p,N) = \\frac{N!}{k!(N-k)!}p^k(1-p)^{N-k}$\n",
      "\n",
      "A series of coin flips is a process whose distribution of outcomes is described by the Binomial distribution with $p=0.5$.\n",
      "\n",
      "Let's start by looking at the exact distribution. Remember that this is a function of a _discrete_ variable."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# matplotlib plots are placed inline\n",
      "%matplotlib inline  \n",
      "\n",
      "# standard matplotlib and numpy imports\n",
      "import matplotlib.pyplot as plt\n",
      "import numpy as np\n",
      "\n",
      "# use scipy.stats to define the distribution\n",
      "import scipy\n",
      "from scipy import stats\n",
      "\n",
      "N = 10 # number of coin flips in a set\n",
      "p = 0.5 # probability of head\n",
      "\n",
      "x = scipy.linspace(0,N,N+1) # create bins\n",
      "pmf = scipy.stats.binom.pmf(x,N,p) \n",
      "# \"pmf\" => probability mass function, which (in a snowstorm of ill-used vocabulary) is in this case what we would usually call\n",
      "# the probability density function\n",
      "\n",
      "plt.bar(x,pmf)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note the unit normalization in the plot above.\n",
      "\n",
      "Now let's look at a histogram made from values randomly draw from the exact distribution. This histograms represents, in this example, a finite-sized set of coin flips."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "N = 10 # number of trials (\"coin-flips\") in a set\n",
      "p = 0.5 # probability of success (\"heads\")\n",
      "size = 50 # number of sets of coin flips\n",
      "\n",
      "np.random.binomial(N,p,size) # number of heads in a set of coin flips"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# how many sets of trials are required to make the approximation appear visually identical to the exact distribution?\n",
      "size = 10\n",
      "\n",
      "data = np.random.binomial(N,p,size)\n",
      "n_bins = N\n",
      "binned_data, bins, patches = plt.hist(data, n_bins, range = (0,10), normed=True)\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's go back to the exact distribution and mess with the parameters. Be sure to adjust both the success probability and the number of trials."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "N = 100 # number of trials in a set\n",
      "p = 0.5 # probability of success \n",
      "\n",
      "x = scipy.linspace(0,N,N+1) # create bins\n",
      "pmf = scipy.stats.binom.pmf(x,N,p)\n",
      "plt.bar(x,pmf)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "\n",
      "**Question:** Do you expect this distribution represent the behavior of social media variables?\n",
      "\n",
      "<br>\n",
      "\n",
      "Now consider the limit where $p\\to 0$ and $N\\to\\infty$, but where $pN = \\nu$ stays constant. \n",
      "\n",
      "In this case, $\\nu$ is the *expected* number of \"successes\", or the number of any \"thing\" in a _large_ data set that appears with low relative frequency. This value $\\nu$ is the single parameter, called the _mean_, of the Poisson distribution. This distribution describes many phenomena, including many variables in social data.\n",
      "\n",
      "* 1 hour or tweets from the firehose $\\longrightarrow$ large $N$  \n",
      "* the probability of ProfileLocation = \"Boulder, Colorado, United States\" $\\longrightarrow$ small $p$\n",
      "\n",
      "So, the number of tweets with ProfileLocation = \"Boulder, Colorado, United States\" in a large sample of tweets will be described by the Poisson distribution with mean $\\nu$. \n",
      "\n",
      "The Poisson distribution is given by:\n",
      "\n",
      "$P(k;\\nu) = \\frac{\\nu^k e^{-\\nu}}{k!}$\n",
      "\n",
      "Let's look at the shape of the distribution:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "nu = 3 # Poisson mean...the most likely value\n",
      "n = nu*3 # this simply sets a range; it is not a parameter of the distribution\n",
      "\n",
      "k = scipy.linspace(0,n,n+1) # create bins\n",
      "pmf = scipy.stats.poisson.pmf(k,nu) \n",
      "\n",
      "plt.bar(k,pmf)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Things to note:  \n",
      "\n",
      "* the distribution is a function of a discrete variable\n",
      "* the distribution has one continuous parameter\n",
      "* the distribution is asymmetric for small values of $\\nu$\n",
      "* the distribution becomes symmetric for large values of $\\nu$\n",
      "\n",
      "Conclusion: the Poisson distribution can be expected to model the number of observations of a rare event in a large set of trials. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Hypothesis Tests and p-values"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Digresion on the Scientific method \n",
      "\n",
      "The scientific method does not prove that a hypothesis is true/correct. Rather, hypotheses are disproved by experimental results that are inconsistent with the hypothesis.\n",
      "\n",
      "In many fields of science, consistency or inconsistency is a discrete state. For systems where the statistics come into play, things are more complicated. We must define what _level of inconsistency_ constitutes the disproval of a hypothesis.\n",
      "\n",
      "## Hypothesis forms\n",
      "\n",
      "To quantitatively test a hypothesis, we need a quantitative hypothesis, and some data.\n",
      "\n",
      "Example hypotheses:  \n",
      "\n",
      "* the rate of tweets with ProfileLocation = \"Boulder, Colorado, United States\" is Poisson distributed with a mean of 4.5 / hour. \n",
      "* the rate of tweets with ProfileLocation = \"Boulder, Colorado, United States\" is Poisson distributed with a mean of 4.5 / 10k tweets. \n",
      "* the number of Twitter users with more than 1000 @-mentions in the past hour is normally distributed with mean 301 and variance 200.\n",
      "\n",
      "**Complication:** note that the hypotheses specify the form of the data distribution, as well as numerical parameters of those distributions. So this is in some sense a composite, conditional hypothesis: we assume both the shape of the distribution and particular values for the parameters. \n",
      "\n",
      "**Simplification:** For now, we'll assume that our _form_ is a well-founded hypothesis, and will not be tested. We will test hypotheses that are defined by the values of the parameters.\n",
      "\n",
      "Now look at some data..."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# define a function that counts the number of \"Boulder, Colorado, United States\" entries \n",
      "# in a file containing ProfileLocation strings \n",
      "def count_boulders(file_name):\n",
      "    f = open(file_name)\n",
      "    boulder_counter = 0\n",
      "    for line in f:\n",
      "        if 'Boulder, Colorado, United States' in line:\n",
      "            boulder_counter += 1\n",
      "    return boulder_counter"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The files 'locations.X.txt', where 1<=X<=12, each contain the ProfileLocation strings from 10k tweets. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print('{} tweets from Boulder'.format(count_boulders('locations.1.txt')))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This value seems like a good first hypothesis. When put into a Poisson distribution, our hypothesis for the distribution of the number of Boulder tweets in 10k tweets is:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "nu = count_boulders('locations.1.txt') # Poisson mean...the most likely value\n",
      "n = nu*3 # this simply sets a range; it is not a parameter of the distribution\n",
      "\n",
      "k = scipy.linspace(0,n,n+1) # create bins\n",
      "pmf = scipy.stats.poisson.pmf(k,nu) #pmf = probability mass function\n",
      "plt.bar(k,pmf)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The distribution is normalized to unit area, so that probabilities for individual bins can be read off the plot. The probability of observing 0 Boulder tweets in 10k is ~5%, while observing 5 is about 10% probable.\n",
      "\n",
      "## Hypothesis tests and the p-value\n",
      "\n",
      "The p-value is defined for a hypothesis and a single experimental result. It is the total probablility, under the given hypothesis, of observing a result as likely or less likely than the actual observation. In its simplest form, this corresponds to integrating the tail of the distribution at and beyond the observed value."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from IPython.display import Image\n",
      "Image(filename='p_val.jpeg')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In many fields of science, the observation of a result with a p-value of less than 0.05 is considered a disproval of the null hypothesis.\n",
      "\n",
      "Next: define a function that returns the p-value for a Poisson distribution with mean $\\nu$ and observed value $x$."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def p_value_poisson(nu,x_obs):\n",
      "    dist = scipy.stats.poisson(nu)\n",
      "    return 1-dist.cdf(x_obs-1) # cdf = cumulative distribution function...integrates from 0 up to x"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now test some values:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(p_value_poisson(5,10)) # probability of observing 10,11,12... when the true mean is 5\n",
      "print(p_value_poisson(4,5)) # probability of observing 5,6,7,8... when the true mean is 4"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Is this a good hypothesis test? Let's use the mean estimated from 'locations.1.txt', and test the p-value of 'locations.2.txt'."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print('{} boulder tweets in {}'.format(count_boulders('locations.1.txt'),'locations.1.txt'))\n",
      "print('{} boulder tweets in {}'.format(count_boulders('locations.2.txt'),'locations.2.txt'))\n",
      "\n",
      "p_value_poisson(count_boulders('locations.1.txt'),count_boulders('locations.2.txt'))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This says that, under our hypothesis of $\\nu=4$, there is a 1% probability of observing a result like or less likely than that observed in 'locations.2.txt'. Is our hypothesis disproved? \n",
      "\n",
      "<br>\n",
      "\n",
      "Let's look at the of p-values for an ensemble of 10k sets of tweets. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for file_number in range(1,12):\n",
      "    file_name = 'locations.' + str(file_number) + '.txt'\n",
      "    print('{0} boulder tweets -> p_value: {1:0.3f}'.format(\n",
      "            count_boulders(file_name),\n",
      "            p_value_poisson(\n",
      "                count_boulders('locations.1.txt'),\n",
      "                count_boulders(file_name)\n",
      "            )\n",
      "        )\n",
      "    )\n",
      "    "
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "It turns out that we based our hypothesis on a relatively uncommon frequency of Boulder tweets. And we tested this hypothesis with a set of data from 'locations.2.txt' which turn out to be even more uncommon.\n",
      "\n",
      "Let's make a few improvements:   \n",
      "\n",
      "* get slightly larger numbers by counting tweets from Boulder in 50k tweet buckets\n",
      "* calculate the $\\nu$ value for our hypothesis by averaging the $\\nu$'s from the first 10 buckets.\n",
      "* calculate the $p$-values for the remaining buckets\n",
      "\n",
      "\n",
      "These data can be found in 'num_boulder_50k.txt', where each line gives the number of tweets from Boulder in a 50k tweet bucket."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "line_counter = 0\n",
      "count_values = [] # array of number of Boulder tweets\n",
      "p_values = [] # array of p-values\n",
      "sum_of_boulder_tweets = 0 # used to get average number of Boulder tweets from first 10 values\n",
      "for line in open('num_boulder_50k.txt'):\n",
      "    if line_counter < 10:\n",
      "        sum_of_boulder_tweets += float(line)\n",
      "    else:\n",
      "        mean = sum_of_boulder_tweets/10\n",
      "        count_values.append(int(line))\n",
      "        p_values.append( p_value_poisson(mean,int(line)) )\n",
      "    line_counter += 1\n",
      "    \n",
      "print('\\nestimate for nu is {}\\n'.format(mean))\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# construct Poisson distribution and overlay it with the observed data\n",
      "n = 10\n",
      "k = scipy.linspace(0,n,n+1) # create bins\n",
      "pmf = scipy.stats.poisson.pmf(k,mean) # use 'mean' from above\n",
      "plt.bar(k,pmf,color='yellow')\n",
      "tmp = plt.hist(count_values, n, range = (0,n), normed=True, histtype='step')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# now show the p-values\n",
      "print('printing {} p-values:\\n'.format(len(p_values)))\n",
      "for val in sorted(p_values):\n",
      "    print('{0:0.3f}'.format(val))\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Even with a hypothesis that we know represents the data relatively well, we still occasionally get small p-values. I hope this makes you uncomfortable.\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Some thoughts to think\n",
      "\n",
      "We've made some assumptions:\n",
      "\n",
      "The value of $\\nu$ is not changing over the time period during which our samples were taken.    \n",
      "\n",
      "* we control this effect by ensuring that all tweets are actually from hour 00 on Mondays in May 2014\n",
      "* if you can't or don't control for this potential bias, your probability for observing very small p-values will increase\n",
      "\n",
      "We could also use a more sophisticated estimation method for $\\nu$, but this won't significantly change the hypothesis  \n",
      "\n",
      "* the problem is not the hypothesis, it's the hypothesis test\n",
      "\n",
      "For a sufficiently large number of results, there will ALWAYS be some with arbitrarily small p-values. So how do we disprove a hypothesis?\n",
      "\n",
      "* some fields of science define disproval with very small p-values (10^-7)\n",
      "\n",
      "We can account for multiple trials with the look-elsewhere effect:  \n",
      "\n",
      "* this is a correction factor to the p-values which account the fact that looking in lots of places leads to improbable results\n",
      "* but it assumes the trials are completely independant\n",
      "\n",
      "**Question:** Instead of performing a series of experiments and making an independant analysis of each result, wouldn't it make more sense to use the output of each trial to influence subsequent analysis results? \n",
      "\n",
      "**Answer:** That sounds awfully Bayesian..."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}